{janitor}

Caroline Kostrzewa, Sabrina Lin

Topic Tuesdays: 2024-10-01

FYI

  • Using three data sets which we have “messed up” in various ways (all from Kaggle)
    • Species
    • Characters
    • Films


* chisq and fisher test function (talk about masking)

Species

id name classification designation average_height skin_colors hair_colors eye_colors average_lifespan language homeworld
1 Human Mammal Sentient 1.80 Light, Dark Various Various 79 Galactic Basic Various
2 Yoda's species Unknown Sentient 0.66 Green White Brown 900 Galactic Basic Unknown
3 Wookiee Mammal Sentient 2.28 Brown Brown Blue 400 Shyriiwook Kashyyyk
4 Gungan Amphibian Sentient 1.96 Orange None Orange 70 Gungan Naboo
5 Twi'lek Mammal Sentient 1.80 Blue, Green, Red, Yellow None Various 80 Twi'leki Ryloth
6 Mon Calamari Amphibian Sentient 1.70 Orange None Red 70 Mon Calamarian Mon Cala
7 Rodian Reptilian Sentient 1.70 Green None Black 78 Rodese Rodia
8 Droid Artificial Non-sentient
Various None Various
Binary None
9 Trandoshan Reptilian Sentient 2.00 Green None Red 70 Dosh Trandosha
10 Kel Dor Mammal Sentient 1.88 Orange None Black 70 Kel Dor Dorin

Characters

id name species gender height weight hair_color eye_color skin_color year_born homeworld year_died description
1 Luke Skywalker Human Male 1.72 77 Blond Blue Light 19 Tatooine 34 The main protagonist of the original trilogy.
2 Leia Organa Human Female 1.50 49 Brown Brown Light 19 Alderaan 35 A leader in the Rebel Alliance and twin sister of Luke Skywalker.
3 Darth Vader Human Male 2.02 136 None Yellow Pale 41 Tatooine 4 The Sith Lord formerly known as Anakin Skywalker.
4 Yoda Yoda's species Male 0.66 17 White Brown Green 896 Unknown 4 A wise and powerful Jedi Master.
5 Han Solo Human Male 1.80 80 Brown Hazel Light 29 Corellia 34 A smuggler turned hero in the Rebel Alliance.
6 Chewbacca Wookiee Male 2.28 112 Brown Blue Brown 200 Kashyyyk
A loyal Wookiee and co-pilot of the Millennium Falcon.
7 Obi-Wan Kenobi Human Male 1.82 81 White Blue-gray Light 57 Stewjon 0 A Jedi Master who mentored Anakin and Luke Skywalker.
8 Emperor Palpatine Human Male 1.73 75 None Yellow Pale 82 Naboo 35 The Sith Lord who became Emperor of the galaxy.
9 Anakin Skywalker Human Male 1.88 84 Blond Blue Light 41 Tatooine 4 A Jedi Knight who fell to the dark side and became Darth Vader.
10 Padmé Amidala Human Female 1.65 45 Brown Brown Light 46 Naboo 19 The queen-turned-senator of Naboo and wife of Anakin Skywalker.

Films

id title release_date director producer opening_crawl
1 Episode I: The Phantom Menace 36299 George Lucas Rick McCallum Turmoil has engulfed the Galactic Republic...
2 Episode II: Attack of the Clones 37392 George Lucas Rick McCallum There is unrest in the Galactic Senate...
3 Episode III: Revenge of the Sith 38491 George Lucas Rick McCallum War! The Republic is crumbling under attacks...
4 Episode IV: A New Hope 28270 George Lucas Gary Kurtz, Rick McCallum It is a period of civil war...
5 Episode V: The Empire Strikes Back 29362 Irvin Kershner Gary Kurtz, Rick McCallum It is a dark time for the Rebellion...
6 Episode VI: Return of the Jedi 30461 Richard Marquand Howard G. Kazanjian, George Lucas, Rick McCallum Luke Skywalker has returned to his home planet...
7 Episode VII: The Force Awakens 42356 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Bryan Burk Luke Skywalker has vanished...
8 Episode VIII: The Last Jedi 43084 Rian Johnson Kathleen Kennedy, Ram Bergman The FIRST ORDER reigns...
9 Episode IX: The Rise of Skywalker 43819 J.J. Abrams Kathleen Kennedy, J.J. Abrams, Michelle Rejwan The dead speak!...
10 Rogue One: A Star Wars Story 42720 Gareth Edwards Kathleen Kennedy, Allison Shearmur, Simon Emanuel The Rebellion makes a risky move...
11 Solo: A Star Wars Story 43245 Ron Howard Kathleen Kennedy, Allison Shearmur, Simon Emanuel Board the Millennium Falcon and journey to a galaxy far, far away...

Cleaning

  • clean_names (cleaning) [mess up names from kaggle species - spongebob meme]
  • make_clean_names (cleaning) [kaggle species data set - change average to mus]
    • mu_to_u

get_dupes()

Shows the duplicated rows according to a certain column (or multiple columns).

characters_messy |> 
  get_dupes(name) |> 
  select(name, dupe_count, species, height, weight, homeworld) |> 
  print(n = Inf)
# A tibble: 28 × 6
   name                 dupe_count species height weight homeworld
   <chr>                     <int> <chr>    <dbl>  <dbl> <chr>    
 1 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 2 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 3 Chirrut Îmwe                  3 Human     1.75     70 Jedha    
 4 Saw Gerrera                   3 Human     1.78     75 Onderon  
 5 Saw Gerrera                   3 Human     1.78     75 Onderon  
 6 Saw Gerrera                   3 Human     1.78     75 Onderon  
 7 Baze Malbus                   2 Human     1.85     95 Jedha    
 8 Baze Malbus                   2 Human     1.85     95 Jedha    
 9 Bodhi Rook                    2 Human     1.8      75 Jedha    
10 Bodhi Rook                    2 Human     1.8      75 Jedha    
11 Cassian Andor                 2 Human     1.84     75 Fest     
12 Cassian Andor                 2 Human     1.84     75 Fest     
13 Ezra Bridger                  2 Human     1.75     70 Lothal   
14 Ezra Bridger                  2 Human     1.75     70 Lothal   
15 Galen Erso                    2 Human     1.83     75 Grange   
16 Galen Erso                    2 Human     1.83     75 Grange   
17 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
18 Grand Admiral Thrawn          2 Chiss     1.95     85 Csilla   
19 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
20 Hera Syndulla                 2 Twi'lek   1.78     55 Ryloth   
21 Jyn Erso                      2 Human     1.7      60 Vallt    
22 Jyn Erso                      2 Human     1.7      60 Vallt    
23 Kanan Jarrus                  2 Human     1.91     86 Coruscant
24 Kanan Jarrus                  2 Human     1.91     86 Coruscant
25 Orson Krennic                 2 Human     1.85     80 Lexrul   
26 Orson Krennic                 2 Human     1.85     80 Lexrul   
27 Sabine Wren                   2 Human     1.65     50 Mandalore
28 Sabine Wren                   2 Human     1.65     50 Mandalore

convert_to_date()

  • Can handle mixed types (“2024-10-01” [character] vs 344402 [numeric])
  • Will handle Excel-specific date numbers (through excel_numeric_to_date())
films_messy |> 
  mutate(release_date_clean = convert_to_date(release_date),
         release_data_lub = lubridate::as_date(release_date)) |> 
  select(title, release_date, release_date_clean, release_data_lub)
# A tibble: 11 × 4
   title                        release_date release_date_clean release_data_lub
   <chr>                               <dbl> <date>             <date>          
 1 Episode I: The Phantom Mena…        36299 1999-05-19         2069-05-20      
 2 Episode II: Attack of the C…        37392 2002-05-16         2072-05-17      
 3 Episode III: Revenge of the…        38491 2005-05-19         2075-05-21      
 4 Episode IV: A New Hope              28270 1977-05-25         2047-05-27      
 5 Episode V: The Empire Strik…        29362 1980-05-21         2050-05-23      
 6 Episode VI: Return of the J…        30461 1983-05-25         2053-05-26      
 7 Episode VII: The Force Awak…        42356 2015-12-18         2085-12-19      
 8 Episode VIII: The Last Jedi         43084 2017-12-15         2087-12-17      
 9 Episode IX: The Rise of Sky…        43819 2019-12-20         2089-12-21      
10 Rogue One: A Star Wars Story        42720 2016-12-16         2086-12-18      
11 Solo: A Star Wars Story             43245 2018-05-25         2088-05-26      

next

  • get_dupes (cleaning) [kaggle characters - Saw Gurerra is duplicated]
  • convert_to_date (cleaning) [kaggle films - make sure dates are excel dates]
    • mention excel conversions specifically
  • remove_empty (cleaning) [make empty column to remove]
  • remove_constant (cleaning) [make constant column to remove]

Tidying

functions

  • round_half_up [kaggle characters - height]
  • round_to_fraction [kaggle characters - height]
  • signif_half_up [kaggle characters - height]
  • compare_df_cols (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]
  • compare_df_cols_same (checking) [split kaggle characters into humans vs non-human and mess one up, then use these to check before row binding]

tabyl()

tabyl() Basics

tabyl() is halfway between table() and gtsummary::cross_tab()

## one variable
characters_messy |> 
  tabyl(death_status)
 death_status  n percent
        alive 36   0.375
         dead 60   0.625


## two variables
characters_messy |> 
  tabyl(death_status, bmi)
 death_status BMI High BMI Low NA_
        alive       11      22   3
         dead       26      33   1

adorn_*()

Functions to make your “tabyls” more readable:

  • adorn_title() – add column name to the top of a two-way table
  • adorn_totals() – add row/column totals to a tabyl
  • adorn_percentages() – converts tabyl of counts to percentages


Functions that come after using adorn_percentages()

  • adorn_pct_formatting() – formatting decimals as percentages
  • adorn_ns() – add underlying Ns to a tabyl with percentages
  • adorn_rounding() – round numeric columns

Adorn with totals and titles

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_totals() |> 
  adorn_title()
                   bmi        
 death_status BMI High BMI Low
        alive       11      22
         dead       26      33
        Total       37      55

Adorn with percentages

characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting()
 death_status BMI High BMI Low
        alive    33.3%   66.7%
         dead    44.1%   55.9%


characters_messy |> 
  tabyl(death_status, bmi, show_na = FALSE) |> 
  adorn_percentages() |> 
  adorn_pct_formatting() |> 
  adorn_ns()
 death_status   BMI High    BMI Low
        alive 33.3% (11) 66.7% (22)
         dead 44.1% (26) 55.9% (33)

FYI

  • untabyl – when would we want to do this??????